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ABSTRACT 

Transmission spectroscopy, which consists of measuring the wavelength-dependent 
absorption of starlight by a planet's atmosphere during a transit, is a powerful probe 
of atmospheric composition. However, the expected signal is typically orders of mag- 
nitude smaller than instrumental systematics, and the results are crucially dependent 
on the treatment of the latter. In this paper, we propose a new method to infer transit 
parameters in the presence of systematic noise using Gaussian processes, a technique 
widely used in the machine learning community for Bayesian regression and classifi- 
cation problems. Our method makes use of auxiliary information about the state of 
the instrument, but does so in a non-parametric manner, without imposing a specific 
dependence of the systematics on the instrumental parameters, and naturally allows 
for the correlated nature of the noise. We give an example application of the method 
to archival NICMOS transmission spectroscopy of the hot Jupiter HD 189733, which 
goes some way towards reconciling the controversy surrounding this dataset in the 
literature. Finally, we provide an appendix giving a general introduction to Gaussian 
processes for regression, in order to encourage their application to a wider range of 
problems. 

Key words: methods: data analysis, stars: individual (HD 189733), planetary sys- 
tems, techniques: spectroscopic, techniques: Gaussian processes 



1 INTRODUCTION 

Transiting exoplanets offer a unique opportunity to study 
the structures and atmospheres of planets outside our Solar 
System. When a planet passes in front of its parent star as 
we view it from Earth, it blocks a proportion of starlight 
which we can use to measure the planet's radius relative to 
its host. Transmission spectroscopy exploits the fact that 
the measured radius of a planet is wavelength dependent, 
as the effective size of a planet's occulting disk depends on 
the height in the atmosphere at which the planet becomes 
opaque to starlight, which in turn depends on the atomic 



and molecular species present in the atmosphere (e.g. Seager 
& Sasselov 2000 Brown 20011. Thus measuring planetary 



transits (i.e. the planet-to-star radius ratio) as a function 
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of wavelength allows us to probe the makeup of a planet's 
atmosphere. 

Observations of transmission spectra, in particular us- 
ing the Hubble Space Telescope (HST) and Spitzer Space 
Telescope (SST) have provided some of the most detailed ob- 



servations of exoplanet atmospheres to date (see e.g. 


Char- 


bonneau et al.|2002| Vidal-Madjar et al.|2003||Charbonneau 


et al.|2005 


Deming et al.|2005l. However, extracting 


trans- 



mission spectra from these data sets is by no means trivial, 
and has led to some discussion regarding the interpretation 
of these signals. Problems arise because the transmission 
signal is often dwarfed by instrumental systematics in the 
datasets. These are caused by imperfect observing condi- 
tions, such as changes in the pointing, detector temperature, 
optics and pixel-to-pixel sensitivity, and result in correlated 
noise in the light curves. This has led to the development of 
some novel techniques to correct and remove instrumental 
systematics, usually by modelling them as a deterministic 
function of auxiliary measurements from the dataset (see 
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e.g.|Brown et al.|2001||Gilliland fc Arribas|2003l |Pont et al.| 



20071. Often however, the choice of systematics model will 
significantly affect the transmission spectra, and therefore 
the detection of atmospheric species. Attempting to develop 
a more general and robust framework to infer transit pa- 
rameters in the presence of instrumental systematics and 
correlated noise provides the motivation for this work. 

The problem of correlated noise for observations of tran- 
siting exoplanets was first raised by |Pont et al.| p006), who 
provided methods to estimate its effects on time series anal- 
ysis, and take it into account when inferring physical proper- 
ties of exoplanets from such data sets. Indeed, the problems 
associated with systematics appear in many different areas 
of exoplanet observations, including transit timing analysis 
(e.g. Gibson et al.||2009| ) and radial velocity measurements 
(e.g. " 



Pont et al.||2011[ ). Notably, |Carter fc Winn 



veloped a fast and powerful wavelet based method to model 
and remove time-correlated noise from transit light curves, 
without the need for additional inputs. Often however, we 
have auxiliary information from the data, such as measure- 
ments of pointing drifts, or changes in the detector and op- 
tical conditions, thought to be the cause of the instrumental 
systematics. This additional information should ideally be 
used to model systematic trends if available. 

Instrumental systematics have proved to be particularly 
problematic for HST/NICMOS observations of transmission 



spectra (e.g. Pont et al. 


2009 


Swain et al. 


2008 Gibson 


et al.|2011l. Swain et al. 


2008 


hereafter SVT08) observed 



and analysed the transmission spectra of HD 189733b, and 
interpreted the results as evidence for CH4 and H2O. Fol- 
lowing previous HST analyses, the instrumental systemat- 
ics in the light curves were modelled as a linear function 
of 'optical state parameters'. The optical state parameters 
are simply auxiliary measurements made directly from the 
spectra, such as the position, width and angle of the spec- 
tral trace, or other parameters reflecting the state of the 
detector and optics, such as the temperature and satellite 
orbital phase. |Gibson et al. | ( |2011| hereafter GPA11), re- 
viewed the evidence for molecular species in this dataset 
using similar methods to model the systematics. This study 
concluded that the transmission spectrum (and hence the 
detection of molecules) was too dependent not only on the 
functional form of the model, but also on the choice of which 
parameters to include. However, this was a rather unsatisfac- 
tory conclusion, as it did not provide an alternative method 
to determine the transmission spectrum, and we have since 
searched for more sophisticated techniques which can pro- 
vide a robust interpretation of these datasets. This paper 
introduces the use of Gaussian Process (GP) models to ad- 
dress this problem. 

GP models are widely used in the machine learning 
community ( jRasmussen fc Wi lliams 2006; Bishop 2006]) for 
Bayesian regression and classification problems, and have re- 



cently been adopted in other areas of astrophysics (e.g. Way 
|fc Srivastava||2li06"l |Mahabal et~aL][2008} |Way et al.||2009| ). 
Rather than impose a deterministic model, GPs define a dis- 
tribution over function space. This allows the instrumental 
systematics to be modelled in a non-parametric way. Using 
GPs, we can then marginalise out our ignorance of the sys- 
tematics model, and determine the posterior distribution of 
the parameters of interest, in this case the planet-to-star ra- 
dius ratio. To put it another way, the form of the systematics 



model is inferred from the data itself, and does not have to 
be set a priori. As GPs are a Bayesian technique, they also 
automatically avoid the problem of over-fitting. As we shall 
see later, through a sensible choice of prior distributions on 
our inputs variables, we may also determine which of the 
input variables are relevant to our dataset. 

This paper describes a GP model for inferring tran- 
sit parameters in the presence of instrumental systematics 
and correlated noise, when additional parameters are mea- 
sured from the data such as the optical state parameters 
mentioned above, but it can easily be generalised to time- 
correlated noise in the absence of such measurements. We 
provide a description of the GP model in Sect. [2] and its 
application to NICMOS transmission spectroscopy of HD 
189733 in Sect. [3] followed by our conclusions in Sect. [4] We 
also provide an appendix giving a brief introduction to GPs 
with simple regression examples. 



2 GAUSSIAN PROCESSES FOR 

TRANSMISSION SPECTROSCOPY 

A Gaussian Process (GP) is a non-parametric method for 
regression, used extensively for regression and classification 
problems in the machine learning community. A GP is de- 
fined as a collection of random variables, any finite number 
of which have a joint Gaussian distribution. A brief intro- 
duction to GPs is provided in the appendix, along with ref- 
erences for further reading. 

Here, we use a GP to specify a non-parametric model 
of the instrumental systematics for a transit observation, 
along with a deterministic transit model to infer the tran- 
sit parameters. We begin by defining the transmission spec- 
troscopy data set, and briefly re-stating the 'standard' linear 
model used in previous analyses, before introducing our GP 
model. 



2.1 Transmission spectroscopy data sets 

Reduced transmission spectroscopy data sets consist of mul- 
tiple transit light curves as a function of time. Here, we 
consider each wavelength channel independently, and store 
the N flux measurements in the vector / = (/1, • • • , /jv) T 
observed at time t — (t\, . . . , tjv) T . Often, additional pa- 
rameters (or optical state parameters) are measured which 
describe the behaviour of the instrument as a function of 
time. These are given by the vector x n — (a;„,i, ■ • ■ , £»i,A') T 
at time n, where K is the number of additional parameters. 
These are collected in the N x K matrix X = (asi, . . . , xn) T , 
and hereafter are called the input parameters. As each light 
curve is treated independently, our model can easily be ex- 
tended to single passband photometric transit observations, 
using additional measured inputs, or simply using time as 
the only input in order to specify time-correlated noise. 



2.2 Linear baseline model 

Often, the baseline flux, which is how the light curve would 
behave in the absence of a transit due to instrumental sys- 
tematics, is modelled as a linear combination of the input 
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parameters: 
/ = X/3 + e, 

where (3 — (/3i, . . . ,/3k) T is a vector containing the coeffi- 
cients of each input parameter, and e — (ei, . . . , ejv) T are the 
residuals, assumed to be independent Gaussian noise. The 
best-fit coefficients are then found by linear least squares for 
the out-of-transit data, given by 



/3 



(X T X) 



x x T /. 



The baseline flux is constructed for all data points using the 
best-fit coefficients, and the transit light curve is divided 
through by the baseline model to remove the instrumental 
systematics. The 'decorrelated' light curve is then fitted with 
a transit model to determine the parameters of interest, us- 
ing varying methods to propagate the uncertainties in the 
systematics model. 



2.3 Gaussian process model 

Rather than impose a simple, deterministic function on the 
instrumental model, we model each wavelength channel as 
a GP with a transit mean function: 

f(t,x)~gv (T(t,0),s(x,e)). 

Here, T is the transit function depending on t and the tran- 
sit parameters <f>, modelled using the analytic equations of 
Mandel fc Agol| ( |2002[ ). £ is the covariance matrix, which is 
a function of x and parameters 6, referred to as hyperparam- 
eters of the GFQ The definition of a GP means that the joint 
probability distribution for the finite set of observations / is 
a multivariate Gaussian distributed about a transit function 
T, with covariance S, given by 



p(/|X,0, 



: J V(r(t,<w,E(x,0)). 



(i) 



The instrumental systematics and noise are specified fully 
by the covariance matrix, which in turn is specified by a co- 
variance function or kernel. As the instrumental systematics 
are a function of the input parameters, we adopt the kernel: 



— k(x n , x m ) — £ exp 



This returns a scalar covariance for each pair of inputs, defin- 
ing each element of the covariance matrix, S nm . £ is a hy- 
perparameter that specifies the maximum covariance, and 
T] = (771 , . . . , tik) T are the inverse scale parameters for each 
input vector. To incorporate white noise, we add a variance 
term a 2 to the diagonal of the covariance matrix, where 5 is 
the Kronecker delta function. We hereafter use the notation 
— (f > T ? T ! o" 2 ) T to represent the covariance hyperparameter 
vector. 

The interpretation of this kernel is straightforward; data 
points that are nearby each other in input space are highly 
correlated, and data points that are far from each other in 
input space are relatively uncorrelated. This kernel there- 
fore describes a smooth function of the input parameters, 



1 Strictly speaking, the transit parameters are also hyperparame- 
ters of the GP, but here we refer to them as the transit parameters 
for simplicity. 



with the addition of white noise. The length scales determine 
how important each vector in input space is to determine the 
'closeness' of two data points, ft therefore enables Automatic 



Relevance Determination (ARD, Neal 1996 also known as 
shrinkage or ridge regression), where the r/i associated with 
each input parameter becomes small when the parameter is 
of little relevance to explain the data set. When r;^ — >• the 
parameter no longer influences the covariance matrix, and 
does not contribute to the model. In other words, the in- 
strumental systematics model assumes a similar value when 
all relevant input parameters are close. This kernel is suit- 
able for application to transmission spectroscopy datasets, 
as squared exponential kernels are suitable for processes with 
a dominant length scale in each input dimension. In trans- 
mission spectroscopy we are interested in instrumental sys- 
tematics within a narrow range of timescales, commensurate 
with the duration of the transit, therefore input parame- 
ters that vary in these timescales. Frequencies higher than 
the sampling can be treated as white noise, and frequen- 
cies lower can be modelled using a simple baseline function 
incorporated into the mean function. 

There are other kernel functions available that could 
be used to model such behaviour, in particular the Matern 
class. This can be seen as a 'rougher' generalisation of the 
squared exponential kernel, that results in less smooth func- 
tions of the inputs. However, this kernel would require an 
extra hyperparameter for each input parameter, and would 
likely prohibit marginalisation over all the hyperparame- 
ters, and perhaps allow too much freedom in the instrument 
model. As the squared exponential kernel defines a prior 
distribution over smooth function of the inputs, we can fur- 
ther justify the use of the this kernel by noting that this 
incorporates all common functions typically used in instru- 
ment models to account for systematics; for example, mod- 
els that are linear, quadratic, and higher order functions of 
the inputs, and thus represents a generalisation of previ- 
ous work. Indeed, this kernel reflects our prior beliefs about 
the properties of the data. The systematics are dominated 
by pixel-to-pixel sensitivity variations, which we would ex- 
pect to smoothly vary with the inputs, particularly when we 
consider that each wavelength channel is binned over many 
pixels. Nonetheless, readers should be aware that there are 
many covariance kernels available with different properties. 
The choice of kernel is an important consideration in apply- 
ing GP models, but the use of a non-parametric systematics 
model allows much greater flexibility than imposing a deter- 
ministic parametric form. 

We have now defined a distribution over smooth func- 
tions of the input parameters to model the instrumental 
systematics. The functional form that describes the base- 
line function may now be inferred from the data itself. As 
the joint probability of the observations defined by the GP 
is multivariate Gaussian (Eq. [T|, the log marginal likelihood 
function is written as 



|S|-^log(27r), (2) 



\ogC{r\X,0,<j>) = -\r T r-\ 

where r is a vector containing the residuals from the mean 
function: 

r = f-T(t,<j>). 

This defines the likelihood as a function of the mean func- 
tion parameters </>, which determine the underlying transit 
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light curve, and the hyperparameters 6, which determine 
the behaviour of the baseline function and the noise. 

In practice, it is convenient to place explicit priors on 
the maximum covariance hyperparameter and the scale- 
length hyperparameters (or hyperpnors) , to encourage their 
values towards zero if they are truly irrelevant to explain the 
data. In our model we use gamma hyperpriors (with shape 
parameter of unity), of the form: 

p(x) = j exp (-x/l) , 

for x ^ 0, and / is the length scale of the hyperprior. The log 
posterior distribution of the transit parameters and hyper- 
parameters is proportional to the sum of the log marginal 
likelihood and log hyperpriors, hence we add the log hyper- 
priors to Eq. [2] as follows: 



io g v(e, 4>\f, x) = log £(r|x, e, <j>) - i -£) 



+C, (3) 



for £, 6i 0, / are the scale lengths of the hyperpriors, and 
C represents additional constant terms. Note that we do not 
explicitly add hyperpriors for the transit parameters or vari- 
ance hyperparameter (implying uniform, improper hyperpri- 
ors). The hyperprior length scales are set to a large value to 
ensure the hyperpriors are non-informative. The challenge 
is now to infer the transit parameters and hyperparameters 
from the posterior probability distribution. 



2.4 Inferring transit parameters 

Now we have specified the log posterior as a function of 
the hyperparameters and the transit parameters (f>, it is 
straightforward (in theory) to infer the planet-to-star radius 
ratio p from the data. For example, to find the maximum 
posterior solution, the log posterior given by Eq. [3] is opti- 
mised with respect to the transit parameters and hyperpa- 
rameters. In a fully Bayesian treatment, we should obtain 
the posterior distribution for each parameter of interest by 
marginalising over all the other parameters of our model, 
i.e. all other mean-function parameters and hyperparame- 
ters. In practice this can be done using Monte-Carlo Markov- 
Chains (MCMC) methods, which are widely used in the ex- 
oplanet community to explore the joint posterior probability 



distribution of multivariate models (see e.g.|Collier Cameron 



et al.|2007||Holman et al.|2006||Winn et al.|2008| ). However" 
each time the hyperparameters are changed (at each step 
in a chain), we must recalculate the covariance matrix and 
its inverse in order to evaluate the log posterior. This scales 
badly with the length of the input time series, requiring 
0(N 3 ) operations. This often makes a full marginalisation 
over all hyperparameters intractable for large data sets, and 
at best slow to compute. It is therefore worth introducing 
an approximation known as type-II maximum likelihood. 

Type-II maximum likelihood involves fixing all the hy- 
perparameters at their maximum likelihood values, and 
marginalising over the remaining parameters of interest. In 
this case we will find the maximum posterior solution, and 
marginalise over the remaining transit parameters, although 
we still refer to this procedure as type-II maximum likeli- 
hood as in the literature. This is a valid approximation when 
the posterior distribution is sharply peaked at its maximum 



around the covariance hyperparameters. First, the log pos- 
terior function is optimised with respect to the hyperparam- 
eters and transit parameters <p, in order to determine the 
maximum posterior solution. This procedure still requires 
the covariance matrix is inverted, but much fewer times than 
for a full marginalisation over all parameters. The hyperpa- 
rameters are then held fixed at their maximum posterior 
values, and to obtain the (approximate) posterior distribu- 
tions for each parameter of interest, one then marginalises 
over the remaining transit parameters, which does not re- 
quire a matrix inversion at each step. One caveat is that we 
don't know if the posterior distribution is sharply peaked 
until we marginalise over all parameters. Consequently it is 
always preferable to marginalise over all hyperparameters 
of the CP when possible, and indeed, this is the final ap- 
proach used in this work. However, for many sets of data 
using the same kernel function, we might confirm the effec- 
tiveness of type-II maximum likelihood for a particular sub- 
set, and apply it to the remainder. Nevertheless, in many 
cases fixing the covariance hyperparameters might be a bet- 
ter approach than imposing a deterministic model of the 
systematics, but such approximations must be used with 
caution. In the following section, we apply our CP model to 
NICMOS transmission spectroscopy of HD 189733, and use 
both type-II maximum likelihood and a full marginalisation 
over all transit parameters and hyperparameters to obtain 
the transmission spectrum. Of course, results from the full 
marginalisation are preferred, but the comparison to type-II 
maximum likelihood is a worthwhile exercise. 



3 APPLICATION TO NICMOS 
OBSERVATIONS OF HD 189733 

3.1 The dataset 

Here we provide a summary of the HST/NICMOS observa- 
tions of HD 189733, used as a test case for our CP model. 
These were originally analysed by SVT08 and reanalysed by 
GPA11, and the dataset used in this paper is the same as 
that used in GPA11. For a more detailed account of the ob- 
servations and data reduction methods, we refer the readers 
to the aforementioned papers. 

A transit of HD 189733 was monitored on 25 May 2007, 
using the G206 grism covering the wavelength range 1.4 - 
2.5 [im. As HD 189733 is not in the continuous viewing zone 
of HST, the light curve was observed over 5 half-orbits, al- 
though orbit 1 is excluded as it shows much larger systematic 
effects attributed to spacecraft 'settling'. The light curves for 
18 wavelength channels were extracted for 638 spectra, along 
with optical state parameters. These are the position of the 
spectral trace along the dispersion axis (AX), the average 
position of the spectral trace along the cross-dispersion axis 
(AY), the angle the spectral trace makes with the x-axis 
(ip), and the average width of the spectral trace (W). The 
temperatur^] (T), and orbital phase (4>h) were also deter- 
mined for each image. 

In total, our dataset consists of a time-series of 519 flux 



2 No direct measurement of the temperature of the detector ex- 
ists, rather a the temperature of the NIC1 mounting cup is used 
as a proxy measurement, see GPA11 for details. 
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Figure 1. The 'raw' HD 189733 NICMOS dataset used as an example for our Gaussian process model. Left: Raw light curves of HD 
189733 for each of the 18 wavelength channels, from 2.47um to 1.49um top to bottom. Right: The optical state parameters extracted from 
the spectra plotted as a time series. These are used as the input parameters for our GP model. The red lines represent a GP regression 
on the input parameters, used to remove the noise and test how this effects the GP model. 



measurements (for orbits 2-5 only) in 18 wavelength chan- 
nels, and time-series of 6 optical state parameters. The 18 
light curves and the optical state parameters are shown in 
Fig. [1] The light curves show clear instrumental systemat- 
ics, and these must be accounted for when measuring the 
transit depth at each wavelength. We have strong reasons 
to suspect that the systematics in the light curves are re- 
lated to the optical state parameters in some way. This is a 
reasonable assumption, as changes in the position, angle and 
width of the spectrum can obviously have an effect on the 
flux collected in each wavelength bin, as can the tempera- 
ture of the detector. However, it is not possible to construct 
a reliable deterministic physical model which relates these 
parameters to the baseline flux, as it depends on each in- 
dividual pixel's sensitivity and response to temperature, as 
well as other complex changes in the optics. 

For each wavelength channel, the GP model outlined 
m Sect. [2] was applied, where / contains the 519 flux 
measurements, and X contain the input parameters, where 
x„ = (AX n , AY n , W„, ip n , T„, 4> n ) T ■ To test that the model 
was not influenced by the noise of the input parameters, 
a GP regression model with zero mean function was fitted 
to each input parameter using a squared exponential ker- 
nel, with time as the only input. The hyperparameters were 



optimised with respect to the likelihood function, and the 
regression is shown by the red lines in Fig. [l] The inference 
described in the following two sections was repeated using 
these de-noised input parameter^] We found that this had 
little influence on the results, and report the results using 
the noisy input parameters. We also checked that the choice 
of hyperprior length scale had little effect on the results, by 
setting them to large values, and repeating the procedure 
with varying length scales ensuring the transmission spec- 
tra were not significantly altered. 



3.2 Type-II maximum likelihood 



As mentioned in Sect. |2.4| a useful approximation is Type-II 
maximum likelihood. First, the log posterior is maximised 
with respect to the hyperparameters and variable mean- 
function parameters using a Nelder-Mead simplex algorithm 
(see e.g. Press et al. 1992 1. The transit model is the same 



as that used in GPA11, which uses Mandel fc Agol (20021 



models calculated assuming quadratic limb darkening and a 



3 We did not use the GP smoothed parameters for T, as we were 
not confident the fit reliably represented the underlying structure. 
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circular orbit. All non- variable parameters were fixed to the 



values given in Pont et al. (20081, except for the limb dark- 



ening parameters, which were calculated for each wavelength 
channel (GPA11, Sing 2010). The only variable mean func- 
tion variable parameters are the planet-to-star radius ratio, 
and two parameters that govern a linear baseline model; an 
out-of-transit flux f 00 t and (time) gradient T gra d. 

An example of the predictive distributions found using 
type-II maximum likelihood is shown in Fig.[2]for four of the 
wavelength channels. In this example, only orbits 2, 3 and 5 
are used to determine the parameters and hyperparameters 
of the GP (or 'train' the GP), and are shown by the red 
point^] Orbit 4 (green points) was not used in the training 
set. Predictive distributions were calculated for orbits 2-5, 
and are shown by the grey regions, which plot the 1 and 2er 
confidence intervals. The predictive distribution is a good fit 
to orbit 4, showing that our GP model is effective at mod- 
elling the instrumental systematics. The final systematics 
model will of course be even more constrained than in this 
example, as we use orbits 2-5 to simultaneously infer pa- 
rameters of the GP and transit function, in particular since 
orbit 4 is the has the most similar input parameters to the 
in-transit orbit (Fig. flj. 

Now that all parameters and hyperparameters are opti- 
mised with respect to the posterior distribution, the hyper- 
parameters are held fixed. This means the inverse covariance 
matrix and log determinant used to evaluate the log poste- 
rior are also fixed, and need to be calculated only once. An 
MCMC is used to marginalise over the remaining parame- 
ters of interest, in this case the planet-to-star radius ratio 
and the linear baseline coefficients. Our MCMC implemen- 



tation is adapted from Gibson et al. ( 2010 1, and uses the log 
posterior function given by Eq. |3j as the merit function. 

For each of the 18 wavelength channels, four separate 
chains of length 20 000 were computed, with the starting 
parameter set initialised to the maximum posterior distri- 
bution parameters with small perturbations applied to the 
variable parameters. During the burn-in phase (first 20% of 
each chain) , the global and relative stepsizes are adapted so 
that ~25% of proposal parameters sets are accepted. After 
burn-in is complete the chains have converged and now sam- 
ple from the posterior probability distribution. These sam- 
ples are used to estimate the marginalised posterior distribu- 
tions for the planet-to-star radius ratio. For each wavelength 
channel, convergence was confirmed for the four chains by 



calculating the Gelman & Rubin (GR) statistic (Gelman fc 
Rubin||l992 I, and in all cases it was within 1% of unity, a 
good sign of mixing and convergence. 

The resulting transmission spectrum consists of the 
planet-to-star radius ratio for each wavelength channel, and 
is shown in the left plot of Fig. [3] The dashed grey line shows 
the value of p for the 'white' light curve (i.e. the light curve 
given by the sum of all wavelength channels) reported in 
GPA11. 



4 We used the smoothed hyperparameters here for aesthetic pur- 
poses. Using the noisy hyperparameters would simply result in 
noisier 1 and 2<r limits, but with similar structure. 



3.3 Marginalisation over the hyperparameters 

Despite the need to invert a matrix at every calculation of 
the log likelihood function, it is still feasible to marginalise 
over all the parameters and hyperparameters of the GP 
model for the HD 189733 dataset using MCMC. The same 
MCMC routine as described in the previous section is used, 
this time allowing all the hyperparameters of the GP to vary 
along with the mean function parameters. The chains were 
initialised from the maximum posterior values, again with a 
small perturbation applied to each variable parameter. 

For each wavelength channel, four chains of length 
150 000 were calculated. As each evaluation of the poste- 
rior probability required a matrix inversion, each chain took 
about ~2.5 hours to run on a standard desktop computer. 
In order to run tests in a reasonable time frame, all 72 
chains were run in parallel using the Oxford e- Research Cen- 
tre supercomputing cluster. The chains were again checked 
for convergence using the GR statistic. For the majority of 
parameters, convergence within 1% of unity was achieved. 
However, some parameters did not converge as efficiently 
and the GR statistic was as much as ~ 5% from unity. This 
can be understood in terms of the degeneracy between in- 
put parameters. Given the inputs show similar structure, 
the same overall systematics model may be described using 
different combinations of scale length parameters. Also, in- 
put parameters that have small r\ values are not strongly 
constrained by the data (or priors) and effectively exhibit 
random walk behaviour. However, importantly the main pa- 
rameters of interest p did converge for all wavelength chan- 
nels. Correlation plots for two of the wavelength channels 
are shown in Fig. [4] 

For each of the wavelength channels, different input pa- 
rameters proved to be relevant to describe the instrument 
model as inferred from the values of the r\ parameters. The 
phase is a consistently relevant input parameter for the ma- 
jority of wavelength channels, and the temperature is not 
particularly relevant for any channels, likely due to the mea- 
surement being particularly noisy and only a proxy measure- 
ment. Of course the true temperature of the detector may 
still have a significant influence of the light curves, but un- 
fortunately this is not available. The other parameters tend 
to vary in their influence, in part due to the degeneracy 
discussed earlier. SVT08 and GPA11 recognised the angle 
as the most important input parameter. This at first might 
seem contradictory to our results, but in this case impor- 
tance was judged on the overall (visual) change in features 
of the outputted transmission spectrum. Here we judge it on 
the hyperparameter values from the posterior distribution. 
Indeed, parameters that are only relevant for a subset of the 
wavelength channels may in fact have the greatest impact 
on any features in the transmission spectrum. 

The resulting transmission spectrum is shown in the 
right plot of Fig. [3] Again the dashed line shows the white 
light curve planet-to-star radius ratio from GPA11. The 
spectrum is consistent with the type-II maximum likeli- 
hood results, the main difference being that the uncertain- 
ties are larger. Therefore, in this case it was important to 
marginalise over all the hyperparameters to obtain good es- 
timates of the uncertainties. Given the convergence was not 
perfect, we ran the same MCMC procedure several more 
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Figure 2. Examples of type-II maximum likelihood regression for four light curves of HD 189733. The red points show those data used 
in the training process (orbits 2, 3 and 5), and the green points show the remaining data (orbit 4). The predictive distributions arc shown 
by the grey regions, which mark the 1 and 2 a confidence intervals. The GP model predictions are consistent with the measured data, 
indicating that our GP model is an effective model of the instrumental systematics. In practice all four orbits are used to model the light 
curve and systematics, and infer the planet-to-star radius ratio, hence the systematics model will be even better constrained than shown 
here. 



times. The transmission spectrum and uncertainties pro- 
duced were almost indistinguishable. 

This spectrum is also consistent with the results of 
SVT08 and GPA11, when using the simple linear basis mod- 
els. However, the uncertainties are much larger, and do 
not provide strong evidence for the molecular features re- 
ported in SVT08. The central 'feature' in the spectrum at 
~ 2.1 u.m is no longer significant. Bluewards of ~ 1.7 (J.m the 
same dip appears, but with lower significance. We empha- 
sise that if the simple linear basis functions provided a good 
explanation for the systematics, our GP model would have 
reproduced the same transmission spectrum (and uncertain- 
ties), as the GP marginal likelihood function favours long 
length scales (i.e. slowly varying functions) if they are suf- 



ficient to explain the instrumental systematics. Therefore, 
we strongly encourage anyone wishing to carry out compar- 
isons with theoretical models to use the more conservative, 
but robust transmission spectrum determined in the present 
work using the GP model. Not doing so would entail the risk 
of over-interpreting features in the spectrum. These results 
are provided in Tab. [I] 

The dip at the left hand side of the spectrum could in- 
deed be the result of a drop in molecular absorption in the 
atmosphere. However, we warn the reader that this region 
of the spectrum is where the flux is lowest, and therefore 
is susceptible to systematics from poor background subtrac- 
tion as discussed in GPA11. This has the effect of stretching 
or compressing the transit light curve in the flux axis, and 
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Figure 3. NICMOS transmission spectra of HD 189733 produced from the GP model. The horizontal dashed line represents the planet- 
to-star radius ratio determined from the white light curve. Left: using the maximum likelihood type-II approximation (see Sect. [3~2] ) . 
Right: obtained by marginalising over all other parameters and hyperparameters (see Sect. [373|l. 



Table 1. NICMOS transmission spectrum of HD 189733 from 
our GP model, giving the wavelength, planet-to-star radius ratio 
p and its uncertainty Ap. The wavelengths are displayed red to 
blue, so that they correspond to the plots in Fig. [I] 



Wavelength 
(urn) 




p 


Ap 


2.468 





15545 


0.00077 


2.411 





15520 


0.00052 


2.353 





15455 


0.00044 


2.296 





15513 


0.00057 


2.238 





15512 


0.00041 


2.181 





15504 


0.00051 


2.124 





15417 


0.00066 


2.066 





15508 


0.00066 


2.009 





15393 


0.00036 


1.951 





15549 


0.00051 


1.894 





15595 


0.00060 


1.837 





15513 


0.00053 


1.779 





15534 


0.00051 


1.722 





15447 


0.00087 


1.665 





15429 


0.00064 


1.607 





15266 


0.00062 


1.550 





15359 


0.00073 


1.492 





15367 


0.00118 



therefore the dip could be the result of systematics remain- 
ing from the data reduction. This raises an important point; 
GP models, or indeed any sophisticated models, can still suf- 
fer from systematic effects not removed from the data sets 
at the initial reduction stage. A more detailed interpreta- 
tion of the HD 189733 spectrum will form the subject of a 
later paper (Gibson et al., in prep), combined with WFC3 
near-infrared observations that overlap the blue region of 
the NICMOS spectrum, and will shed some light on the in- 
terpretation of this feature. 



4 SUMMARY AND DISCUSSION 

We have introduced Gaussian Processes as an alternative to 
deterministic models for the modelling and removal of sys- 
tematics in the presence of auxiliary input parameters. GPs 
provide a powerful Bayesian approach to place distributions 
over functions. Rather than impose a parametric model of 
instrumental systematics, by using GPs one can marginalise 
out their ignorance of the functional form of the systematics 
model. GPs allow an arbitrary number of additional input 
parameters, and a framework to determine which parame- 
ters are important to the analysis. 

We demonstrated the application of our GP model 
to NICMOS transmission spectroscopy of HD 189733. The 
transmission spectrum is consistent with those found in 
SVT08 and GPAU using simple linear basis functions, al- 
though with larger uncertainties. This is because the linear 
basis functions are not sufficient to explain the instrumental 
systematics, and therefore do not provide realistic treatment 
of the uncertainties. Detailed interpretation of these results 
will be included in a later paper. 

Our GP model may also be used in the absence of aux- 
iliary information, by using a time dependant covariance 
kernel to model time-correlated noise in exoplanet tran- 
sits. However, due to the considerable computation time re- 
quired, existing methods, in particular the wavelet method 
of Carter & Winn|(|2009|, are considerably faster where ap- 



plicable for large transit data sets, and are preferable in the 
absence of additional inputs, although limited to regularly 
sampled data. Comparing results of both methods for time- 
correlated noise will provide useful tests, and may form the 
subject of future work. 

Recently, |Waldmann| ( |2011[ ) proposed another non- 
parametric method for removing systematics from transmis- 
sion spectroscopy datasets, based on Independent Compo- 
nent Analysis. This uses a completely different approach, 
and searches for common signals in multiple transit light 
curves, used to remove the systematics from the data. These 
new methods based on non-parametric models should pro- 
vide effective new tools to robustly extract signals in the 
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Figure 4. Posterior distributions of the variable transit parameters and hyperparameters from the MCMC chains for two of the 
wavelength channels, shown in the lower and upper triangles. The scatter plots show all pairs of parameters plotted after marginalisation 
over all other parameters, and the histograms show the marginalised posterior distribution of each individual parameter. The different 
colours represent the four separate MCMC chains. The lower triangle one shows one of the better converged wavelength channels, and 
the upper one shows one with some poorly converged hyperparameters. 



presence of unknown systematic noise, and the development 
of such methods are important to future exoplanet research. 
It would be interesting to compare results from our GP 
model to those using the approach proposed by j Waldmann| 
( |2011[ ) , as the two methods may prove complementary, given 
that they make different assumptions about the systematics. 

GP models provide a general framework that can be ap- 
plied to many different problems in regression, interpolation 
and prediction where a deterministic function is not avail- 



able. The major limitation to applying GP models is that 
each evaluation of the posterior probability requires a ma- 
trix inversion, which scales badly with the size of the data 
set. This is unavoidable for datasets which require marginal- 
isation over function space. This difficulty may be some- 
what eased in the near future by using sparse GP methods 



(e.g. |Quinonero-Candela fc Rasmussen||2005 Walder et al 



2008), which approximate the covariance matrix as sparse, 
or using utilising rank-reduction methods for large matrices 
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where appropriate (e.g. Foster et al.|2009 Way et al.|2 009), 
and/or by using GPUs to vastly speed up matrix inversion 



(e.g. Volkov & Demmcl 2008). Indeed, ongoing research in 
this area should open up the application of GPs to many 
more astronomical datasets. 
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APPENDIX A: GAUSSIAN PROCESSES FOR 
REGRESSION 

Gaussian process (GP) models are extensively used in the 
machine learning community for Bayesian inference in non- 
parametric regression and classification problems. They can 
be seen as an extension of kernel regression to probabilis- 
tic models. This appendix aims to give a brief introduction 
to GPs used for regression problems. Our explanations are 
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based on the textbooks by Rasmussen & Williams ( 2006 1 



and Bishop ( 2006 \ , where the interested reader will find 



more complete and detailed information. 



Al Introducing Gaussian Processes 

We first define a collection of TV observations, consisting 
of independent variables x n and observed values y n . Here 
x n — (a?i,n, • • • , id,h) T is a D-dimensional input vector, and 
y„ is the corresponding scalar outpulQ We arrange the TV 
inputs into an TV x D matrix X = (asi, . . . , £Cjv) t , and the TV 
observed outputs into a vector y — (j/i, . . . , j/iv) T - 
A standard approach is to model data as 

y = m(X, 4>) + e, 

where m is the mean function with parameter vector <p, and 
e represents independent and identically distributed (i.i.d.) 
Gaussian noise: 

p(e) =Af{0,o- 2 \). 

Here TV represents a multivariate Gaussian distribution with 
mean vector and covariance matrix a 2 \, where a 2 is the 
variance of a single observation and and I is the TV x TV 
identity matrix. The joint probability distribution for a set 
of outputs y is therefore 

P(V I X, 4>, a) = TV(m(X, </>), <r 2 l). 

For example to represent photometric observations of a plan- 
etary transit, y would be the measured flux, typically the 
input matrix X contains only time, and <p are the transit 
parameters. Thus the above expression is the standard like- 
lihood function often used for transit light curve fitting. This 
could be extended to include additional inputs in the ma- 
trix X, such as auxiliary data pertaining to the state of the 
instrument (e.g. detector temperature) and observing con- 
ditions (e.g. airmass). 



Formally, Rasmussen & Williams ( 2006 ) define a GP as 



a collection of random variables, any finite number of which 
have a joint Gaussian distribution. GPs are often written as 

y(x) ~QV (m(as),E), 

where S is the covariance matrix, and we have dropped the 
mean function parameters from the notation for simplicity. 
A GP only has two parameters: the mean and covariance. 
The parameters of the mean function and covariance func- 
tion, <p and 0, respectively, are the hyperparameters of the 
GP. 

As any finite collection of observations has a joint Gaus- 
sian distribution, the joint probability distribution of y is 
given by 



p(y\X,ct>,0) =JV(m(X),£) 



(Al) 



We can see that assuming an i.i.d noise model is in fact a 
special case of a GP with a diagonal covariance matrix. In a 
GP model, each element in the covariance matrix is defined 
by the kernel or covariance junction which has hyperparam- 
eters and is given by 



= fe(a 



where the kernel maps the input vectors a; to a scalar co- 
variance. The specification of a covariance function implies 
a probability distribution over functions, thus a GP can be 
viewed as a distribution over functions. To illustrate this we 
will adopt a commonly used covariance function, the squared 
exponential, given by 



k{x n , x Tt 



>exp 



where 8o is the maximum covariance and 9\ is an inverse 
length scale parameter (this becomes more obvious if we 
write 8i — 1/21 2 to get an unnormalised Gaussian). As there 
is only one hyperparameter governing the length scale for all 
dimensions, this is called an isotropic kernel. 

The GP now has a simple interpretation; data points 
that lie near each other in input space are highly correlated, 
and data points which are distant in input space are rela- 
tively uncorrelated. We can draw samples from the GP by 
choosing a number of inputs, calculating the corresponding 
covariance matrix, and generating a random Gaussian vec- 
tor from the multivariate distribution. The top two plots in 
Fig. |A1| show examples of three vectors drawn at random 
from a GP with a 1 dimensional input space, with a squared 
exponential kernel function, and zero mean function. The 
dark and light grey shaded areas represent the 1 and 2a 
boundaries of the prior distribution, respectively. The left 
plot has hyperparameters {#o,#i} = {25,1/8} (I = 2), 
and the right plot has hyperparameters {6o,9i} = {25,2} 
(I — 0.5). These sample vectors represent random smooth 
functions drawn from the prior (specified by the hyperpa- 
rameters). The longer inverse length scale means that the 
functions on the right plot vary more quickly. 

A GP with a squared exponential kernel function there- 
fore defines a distribution over smoothly varying functions 
y(x). The 9q hyperparameter determines the maximum vari- 
ation of the function, and the inverse length scale 6\ deter- 
mines how quickly varying the function is; in other words 
how far input values need to be apart in order to be- 
come uncorrelated. It can be shown that GP regression 
with the squared exponential kernel function is equivalent 
to Bayesian linear regression with an infinite number of ba- 
sis function^] 

The mean function of the GP controls the determinis- 
tic component of the model, whilst the covariance function 
controls its stochastic component. Multiple covariance func- 
tions can be added or multiplied together to model multiple 
stochastic processes (as they are still valid covariance func- 
tions). For example, most GP models include an i.i.d com- 
ponent. To incorporate this, we simply add a term to the 
diagonal of the covariance matrix. The covariance function 
is then given by 



5 It is straightforward to extend this model to multi-dimensional 
outputs, but for simplicity we consider only scalars. 



k(x n , x m ) = O exp ( —6x y^(a?j, n - x itm ) 2 j + 8 nm a 2 , 

where 5 is the Kronecker delta function, and a 2 is a new hy- 
perparameter representing the variance of the white noise. 



6 In practice we need only consider the function values for a finite 
set of inputs making it possible to work in the infinite space of 
basis functions. This is impossible using standard basis function 
models. 
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Therefore a GP with this covariance function now defines a 
distribution over smoothly varying functions with the addi- 
tion of white noise. 

The joint probability distribution of the GP given by 
Eq. | Al| is the likelihood of observing y, given the inputs X, 
mean function parameters <j> and hyperparameters 0. This 
likelihood is already marginalised over all possible realisa- 
tions of the stochastic component that satisfy the covari- 
ance matrix (or all possible functions in the distribution 
specified by the GP) and hence is commonly referred to as 
the marginal likelihood. In this sense, GPs are intrinsically 
Bayesian and help to mitigate against the problem of over- 
fitting associated with non-Bayesian methods. 

A2 Gaussian Process regression 

Given a set of N observed data points y with correspond- 
ing inputs X (in machine learning literature known as the 
training data), we can use a GP to make predictions for ad- 
ditional input values given by x+ (known as the test data). 
We want to obtain the predictive distribution for the output 
y* conditioned on the training data. According to the prior, 
the joint probability distribution of the training outputs y 
and the test output is Gaussian and given by 



y 

y* 



--N 



rrc(X) 
mix*) 



s 



where £ is the covariance matrix for the training data 
points, fc* is the column vector with elements k(x n ,x*) for 
n = (1, . . . , N), and c is the scalar given by fe(as*, a;*) + a 2 . 

The predictive distribution of is then obtained by 
conditioning on the observed data points. Using standard 
results the conditional joint posterior distribution is another 
Gaussian: 

V {v*\ x+,y, X, 6,4>)=N (y*,a 2 ) , 

where y t and a 2 are the mean and variance of the distribu- 
tion, respectively, given by 

y* = m(x i ,) + fcjf S _1 r, 



and 



and r = y — m(X) is the vector containing the residuals from 
the mean function. 

These results may easily be extended to make predic- 
tions for AT* test inputs given by X*. If we define K* as the 
N x N+ matrix containing the covariance for all pairs of 
training inputs and test inputs, and K** as the N+ x iV* co- 
variance matrix of the test inputs, the distribution is again 
Gaussian with mean y and covariance matrix C, given by 

y yt = m(X*) + K^S- 1 r I 

and 

C = K** K^S 1 K*. 



Fig. |A1| gives simple examples of GP regression con- 
ditioned on some artificial training data, shown by the red 
points. The top plots show the prior probability distribution 
and random vectors drawn from it, as described in previous 
section. The lower plots show the distribution conditioned on 
the training data, with the mean indicated by the thick black 



line, surrounded by the 1 and 2a predictive distributions. 
These use the same hyperparameters as the corresponding 
plots above, with an additional white noise hyperparameter 
equal to 1. We can think of this process as a weighted av- 
erage of the functions generated by the prior, weighted on 
their likelihood function, or mathematically, marginalising 
over function space. The predictive distribution therefore 
becomes narrower as more of function space is restricted. 
Note that with the longer length scale the distribution is 
narrowed along most of the x range plotted, whereas with 
the shorter length scale the distribution quickly resorts to 
the prior distribution far from any observed data. The plot- 
ted distributions include white noise for the training data, 
and the three coloured lines are random functions drawn 
from the posterior distribution^] Clearly the value of the hy- 
perparameters has a large impact on the regression; luckily 
it is possible to infer these from the training data. 

A3 Learning the hyperparameters 

GPs are extremely effective for interpolation and prediction, 
and indeed this is one of their main applications in the ma- 
chine learning community. However, GPs can also be used to 
infer distributions of the mean function parameters and/or 
hyperparameters, either of which may be the quantities of 
interest. We consider the marginal likelihood function from 
the previous sections: 

U (27r)«/2jE| 1/2 V 2 

where we have now explicitly written the Gaussian function. 
It is convenient to use the log likelihood: 



log£(r|X,0,0) = -ir T £- 



i 1 , 
r - - log 



SI - 



;(2tt) 



Ideally, in a Bayesian framework one should obtain the pos- 
terior probability distribution of each hyperparameter of in- 
terest by marginalising over all the others. This can be done, 
for example, using a sampling technique such as Markov 
Chain Monte Carlo. However, this is expensive to compute 
as each evaluation of the likelihood function we must invert 
the N x N covariance matrix, requiring 0(n 3 ) operations. 

One alternative is to make an approximation known as 
type-II maximum likelihood. This involves fixing the covari- 
ance hyperparameters at their maximum likelihood values, 
and marginalising over the remaining mean function hyper- 
parameters of interest. This is a valid approximation when 
the posterior distribution is sharply peaked at its maximum 
with respect to the covariance hyperparameters, and should 
give similar results to the full marginalisation. 

Fig. |A2| gives examples of GP regression using a squared 
exponential covariance function, after optimising the hyper- 
parameters via a Nelder-Mead simplex algorithm (see e.g. 
Press et al.|1992 l. The training data (red points) were gen- 
erated from various functions (shown by the green dashed 
line) with the addition of i.i.d noise. All were modelled using 
a zero-mean function, with the exception of the transit func- 
tion at the bottom left, which used a transit mean function. 

7 When plotting the distribution we need only consider the diag- 
onal of the matrix K**, whereas when generating random vectors 
we must consider the full covariance matrix of the test inputs. 
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Figure Al. Top: Examples of random functions drawn from a GP with a squared exponential covariance kernel. The left and right plots 
have hyperparameters {80, 9±} = {25,2}, and {8g, #1} = {25,0.5}, respectively. The dark and light grey shaded areas represent the 1 
and 2 a boundaries of the distribution. Bottom: The predictive distributions generated after adding some training data and conditioning 
using the same hyperparameters as the corresponding plots above, with the addition of a white noise term (<r 2 = 1). The coloured lines 
are now random vectors drawn from the posterior distribution. 



In the top left the data were generated from a straight line 
function. The marginal likelihood strongly favours functions 
with a longer length scale, and the GP is able to predict the 
underlying function both when interpolating and extrapolat- 
ing away from the data. The top right data was generated 
from a quadratic function, and again the GP reliably inter- 
polates the underlying function. However, as the length scale 
is now shorter, the predictive distribution veers towards the 
prior more quickly outside the training data. These two ex- 
amples show that GPs with squared exponential kernels are 
very good at interpolating data sets providing the data is 
generated from a smooth function. The bottom left plot 
shows data drawn from a transit mean function with sinu- 
soidal and Gaussian time-domain 'systematics' added. The 
GP model has a transit mean function, and the additional 
'systematics' are modelled by the GP. Finally, the bottom 
right plot shows data drawn from a step function. In this 
case the GP interpolation is less reliable. This is because 
the squared exponential function implies smooth functions. 
A more appropriate choice of covariance kernel is required 
for such functions (e.g. |Garnett et al.|2010[ ). Indeed, a wide 
choice of covariance kernels exist which effectively model 



periodic, quasi-periodic and step functions (see Rasmussen 



|fc Williams||2006| , enabling GPs to be a very flexible and 
powerful framework for modelling data in the absence of a 
parametric model. 
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Figure A2. Examples of GP regression using type-II maximum likelihood. The dashed green lines represent the functions that the 
training data were drawn from, before adding Gaussian i.i.d. noise. All data were fitted using a GP with squared exponential kernel, 
and all used a zero mean function with the exception of the bottom left plot, which used a transit mean function. The solid grey and 
black regions represent the mean, and 1 and 2<r confidence intervals. Clockwise from top left, the function used to generate the data is: 
a straight line, a quadratic function, a step- function, and a planetary transit model with the addition of correlated 'noise'. See text for 
discussion. 
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